import numpy as np
from sklearn import datasets, linear_model, metrics
from sklearn.linear_model import LinearRegression
import plotly.express as px
import pandas as pd
Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
- Number of Instances:
442
- Number of Attributes:
First 10 columns are numeric predictive values
- Target:
Column 11 is a quantitative measure of disease progression one year after baseline
- Attribute Information:
age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples (i.e. the sum of squares of each column totals 1).
Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
Fot more information, see the original website below
https://scikit-learn.org/stable/datasets/toy_dataset.html
X,Y=datasets.load_diabetes(return_X_y=True)
Dataset shape
Y.shape
(442,)
X.shape
(442, 10)
Organize data in Pandas DataFrame
inp=['x'+str(i) for i in range(X.shape[1])]
inp+['y']
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y']
df=pd.concat([pd.DataFrame(X,columns=inp),pd.DataFrame(Y,columns=['y'])],axis=1)
df.head(3)
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
Explanation of each column of dataframe
string='''age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
quantitative measure of disease progression one year after baseline'''
print(pd.DataFrame(zip(inp+['y'],string.split('\n')),columns=['column','meaning']).to_markdown(index=False))
| column | meaning | |:---------|:--------------------------------------------------------------------| | x0 | age age in years | | x1 | sex | | x2 | bmi body mass index | | x3 | bp average blood pressure | | x4 | s1 tc, total serum cholesterol | | x5 | s2 ldl, low-density lipoproteins | | x6 | s3 hdl, high-density lipoproteins | | x7 | s4 tch, total cholesterol / HDL | | x8 | s5 ltg, possibly log of serum triglycerides level | | x9 | s6 glu, blood sugar level | | y | quantitative measure of disease progression one year after baseline |
Output correlation
px.bar(df.corr()['y'][inp])
Generate descriptive statistics of the dataframe
df.describe()
| x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 442.000000 |
| mean | -2.511817e-19 | 1.230790e-17 | -2.245564e-16 | -4.797570e-17 | -1.381499e-17 | 3.918434e-17 | -5.777179e-18 | -9.042540e-18 | 9.293722e-17 | 1.130318e-17 | 152.133484 |
| std | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 77.093005 |
| min | -1.072256e-01 | -4.464164e-02 | -9.027530e-02 | -1.123988e-01 | -1.267807e-01 | -1.156131e-01 | -1.023071e-01 | -7.639450e-02 | -1.260971e-01 | -1.377672e-01 | 25.000000 |
| 25% | -3.729927e-02 | -4.464164e-02 | -3.422907e-02 | -3.665608e-02 | -3.424784e-02 | -3.035840e-02 | -3.511716e-02 | -3.949338e-02 | -3.324559e-02 | -3.317903e-02 | 87.000000 |
| 50% | 5.383060e-03 | -4.464164e-02 | -7.283766e-03 | -5.670422e-03 | -4.320866e-03 | -3.819065e-03 | -6.584468e-03 | -2.592262e-03 | -1.947171e-03 | -1.077698e-03 | 140.500000 |
| 75% | 3.807591e-02 | 5.068012e-02 | 3.124802e-02 | 3.564379e-02 | 2.835801e-02 | 2.984439e-02 | 2.931150e-02 | 3.430886e-02 | 3.243232e-02 | 2.791705e-02 | 211.500000 |
| max | 1.107267e-01 | 5.068012e-02 | 1.705552e-01 | 1.320436e-01 | 1.539137e-01 | 1.987880e-01 | 1.811791e-01 | 1.852344e-01 | 1.335973e-01 | 1.356118e-01 | 346.000000 |
Box plot of the output
fig = px.box(df, y="y", height=400, width=400,)
fig.show()
This section explain a linear regression (LR) realisation only by using NumPy libraries, and a comparison with LR of sklearn library
y=Y.reshape((len(Y),1))
y.shape
(442, 1)
This transformation allow to transform a nonlinear equation like y=ax + b to a linear one like y=AX, see below
$ {\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{bmatrix}},} $
UN=(np.zeros(X.shape[0])+1).reshape((X.shape[0],1))
Xu=np.concatenate(( UN,X), axis=1)
X[0:3,0:3]
array([[ 0.03807591, 0.05068012, 0.06169621],
[-0.00188202, -0.04464164, -0.05147406],
[ 0.08529891, 0.05068012, 0.04445121]])
Xu[0:3,0:4]
array([[ 1. , 0.03807591, 0.05068012, 0.06169621],
[ 1. , -0.00188202, -0.04464164, -0.05147406],
[ 1. , 0.08529891, 0.05068012, 0.04445121]])
X.shape
(442, 10)
Xu.shape
(442, 11)
Handly
Using 0 matrix juste to verify the shape
# Using 0 matrix juste to verify the shape
B=np.zeros((Xu.shape[1],y.shape[1]))
B.shape
(11, 1)
Xu.dot(B).shape , y.shape
((442, 1), (442, 1))
Now putting the independent and dependent variables in matrices *X* and *Y* respectively, the loss function can be rewritten as:
Loss calculation
loss=(Xu.dot(B)-y).T.dot(Xu.dot(B)-y)
loss.shape
(1, 1)
Remove the extra dimension
np.squeeze(loss)
array(12850921.)
As the loss is convex the optimum solution lies at gradient zero. The gradient of the loss function is (using Denominator layout convention):
$ {\displaystyle {\begin{aligned}{\frac {\partial L\left(D,{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}&={\frac {\partial \left(Y^{\textsf {T}}Y-Y^{\textsf {T}}X{\vec {\beta }}-{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}Y+{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}X{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}\\&=-2X^{\textsf {T}}Y+2X^{\textsf {T}}X{\vec {\beta }}\end{aligned}}} $
dloss=-2*Xu.T.dot(y)+2*Xu.T.dot(Xu).dot(B)
dloss.shape,B.shape
((11, 1), (11, 1))
Setting the gradient to zero produces the optimum parameter:
${\displaystyle {\begin{aligned}-2X^{\textsf {T}}Y+2X^{\textsf {T}}X{\vec {\beta }}&=0\\\Rightarrow X^{\textsf {T}}X{\vec {\beta }}&=X^{\textsf {T}}Y\\\Rightarrow {\vec {\hat {\beta }}}&=\left(X^{\textsf {T}}X\right)^{-1}X^{\textsf {T}}Y\end{aligned}}}$
# Optimal B that give a dLoss/dB(Bopt)=0
Bopt=np.linalg.inv(Xu.T.dot(Xu)).dot(Xu.T.dot(y))
Bopt.shape
(11, 1)
dLoss/dB at Bopt is close to 0
dloss=-2*Xu.T.dot(y)+2*Xu.T.dot(Xu).dot(Bopt)
np.sum(dloss).sum()
2.2652102416031994e-11
df['y_hat']=Xu.dot(Bopt)
px.line(df,y=['y','y_hat'])
class linear_regression():
def __init__(self):
self.B=0
def add_one(self,X):
# This methode add one to X input
UN=(np.zeros(X.shape[0])+1).reshape((X.shape[0],1))
return np.concatenate(( UN,X), axis=1)
def fit(self,X,Y):
# This method calculate the optimal B
Xu=self.add_one(X)
Bopt=np.linalg.inv(Xu.T.dot(Xu)).dot(Xu.T.dot(Y))
self.B=Bopt
def predict(self,X):
# This method allows object to predict the Y
B=self.B
Xu=self.add_one(X)
y_hat=Xu.dot(B)
return y_hat
def loss(self,X,Y):
# This method calculate the loss with the B parameters
B=self.B
Xu=self.add_one(X)
loss=(Xu.dot(B)-Y).T.dot(Xu.dot(B)-Y)
return loss
def dloss(self,X,Y):
# This method calculate the dLoss/dB with the B parameters
B=self.B
Xu=self.add_one(X)
dloss=-2*Xu.T.dot(Y)+2*Xu.T.dot(Xu).dot(B)
return dloss
LR=linear_regression()
LR.fit(X,Y)
y_hat2=LR.predict(X)
LR.loss(X,Y)
1263985.7856333433
dLoss/dB is null @ Bopt
np.abs(LR.dloss(X,Y)).sum()
4.3513637137948535e-11
Calculate Y_hat
df['y_hat2']=y_hat2
(df.y_hat-df.y_hat2).abs().sum()
0.0
Comparaison the first y_hat with the second one calculated by the LR object
px.line(df,y=['y_hat','y_hat2'])
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,Y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
y_hat_sklearn=reg.predict(X)
y_hat_sklearn.shape
(442,)
df['y_hat_sklearn']=y_hat_sklearn
Sklearn.linear_model and the local LR have the same output y
px.line(df,y=['y_hat2','y_hat_sklearn'])
The error between both outputs is close to 0
(df.y_hat2-df.y_hat_sklearn).abs().sum()
5.17630383001233e-11
The confessions of both model are the same, the except difference is that the 1 in our model is in the beginning instead of the end of B Matrix
reg.coef_[:-1]
array([ -10.0098663 , -239.81564367, 519.84592005, 324.3846455 ,
-792.17563855, 476.73902101, 101.04326794, 177.06323767,
751.27369956])
LR.B[1:]
array([ -10.0098663 , -239.81564367, 519.84592005, 324.3846455 ,
-792.17563855, 476.73902101, 101.04326794, 177.06323767,
751.27369956, 67.62669218])
Make a new output with 2 dimensions from Y with a same formula
Y.shape
(442,)
Y2=np.stack((Y,0.5*Y),axis=1)
Y2.shape
(442, 2)
Calculate the Ŷ with LR of sklearn model
reg = LinearRegression()
reg.fit(X,Y2)
y_hat_sklearn2=reg.predict(X)
y_hat_sklearn2.shape
(442, 2)
Calculate the Ŷ with the local LR object
LR=linear_regression()
LR.fit(X,Y2)
y2_hat=LR.predict(X)
The error between both outputs is close to 0
np.abs(y_hat_sklearn2-y2_hat).sum()
1.4384227142727468e-10
The loss at 0 at Bopt
LR.loss(X,Y2)
array([[1263985.78563334, 631992.89281667],
[ 631992.89281667, 315996.44640834]])
dLoss/dB is close to 0 at Bopt
np.abs(LR.dloss(X,Y2)).sum()
7.004530289123068e-11
A local linear regression (LR) object was generated only by pure numpy code, and this object has the same behaviours of Sklearn LR.